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o 

^Sj ■ The early sections of this paper present an analysis of a Markov decision model that is known as the 

;_( , multi-armed bandit under the assumption that the utility function of the decision maker is either linear 

or exponential. The analysis includes efficient procedures for computing the expected utility associated 
with the use of a priority policy and for identifying a priority policy that is optimal. The methodology 
in these sections is novel, building on the use of elementary row operations. In the later sections of this 
paper, the analysis is adapted to accommodate constraints that link the bandits. 



u 

O ! 1 Introduction 

j^ ' The colorfully-named multi-armed bandit ifTOl is the following Markov decision problem: At epochs 1,2,..., 

a decision maker observes the current state of each of several Markov chains with rewards (bandits) and plays 
one of them. The Markov chains that are not played remain in their current states. The Markov chain that 
►^ is played evolves for one transition according to its transition probabilities, earning an immediate reward 

O ' (possibly negative) that can depend upon its current state and on the state to which transition occurs. Hence- 

\^ ■ forth, to distinguish the states of the individual Markov chains from those of the Markov decision problem, 

'^ ' the latter are called multi-states; each multi-state prescribes a state for each of the Mai^kov chains. 

^ I A key result for the multi-armed bandit is that attention can be restricted to a simple class of decision 

CN \ procedures that are based on "labelings." A labeling is an assignment of a number to each state of each 

bandit such that no two states have the same number (label), even if they are in different bandits. A priority 
rule is a policy that is determined by a labeling in this way; given each multi-state, the priority rule plays 
the Markov chain whose current state has the lowest label. In a seminal 1974 paper, Gittins and Jones |[T2l 
(followed by ifTOJ ) demonstrated the optimality of a priority mle for a model whose objective is to maximize 
expected discounted income with a per-period discount factor c having < c < 1. The (optimal) priorities 
that they identified are based on a family of stopping times, one for each state of each chain. Given state 
i of bandit k, the decision maker is imagined to play bandit k for any number r (r > 1) of consecutive 
epochs, observing the state to which each transition occurs, and stopping whenever he or she wishes to do 
so. The discounted present value of the (random) income stream that is received during epochs 1 through r 
is denoted X{t). The stopping times r for state i are used to assign that state an index I{i) by 

/(i)=max| — M^i. (1.1) 
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It was demonstrated in |[T2l [TOl that, given each multi-state, it is optimal to play any Markov chain (bandit) 
whose current state has the largest index (lowest label). 

Following |[T2l [TOl . the multi-armed bandit problem has stimulated research in control theory, eco- 
nomics, probability, and operations research. A sampling of noteworthy papers includes Bergemann and 
Valimakim 121, Bertsimas and Niiio-Mora |^, El Karoui and Karatzas ||8|, Katehakis and Veinott fTSll . 
Schlag 113, Sonin HI, Tsisiklis m, Variaya, Wakand and Buyukkoc ^, Weber ||22l, and Whittle fM .. 
Books on the subject (that list many references) include Berry and Fristedt |l3], Gittins lITTI . Gittins, Glaze- 
brook and Weber |[T3l . The last and most recent of these books provides a status report on the multi-armed 
bandit that is almost up-to-date. 

An implication of the analysis in |[T2l[T0l is that the largest of all of the indices equals the maximum over 
all states of the ratio r(i)/(l — c), where r{i) denotes the expectation of the reward that is earned if state 
i's bandit is played once while state i is observed and where c is the discount factor. In 1994, Tsitsiklis |[T9l 
observed that repeated play of a bandit while it is in the state i whose ratio is largest leads to a multi-armed 
bandit with one fewer state and random transition times. 

In 2007 Denardo, Park and Rothblum O considered a generalization of classic multi-armed bandit 
model with the following new features: 

• The utility function of the decision maker can be exponential, expressing sensitivity to risk. 

• In the case of linear utility functions, the assumption that rewards are discounted is replaced by the 
introduction of stopping (which captures discounting). 

The analysis of ^ focused on pair-wise comparisons It avoided the use of stopping times, which had been 
a common feature of the prior analyses of multi-armed bandits. It relied on linear algebra, rather than on 
probability theory. It avoided the need to deal, in the more general cases, with ratios that had zeros in their 
denominators. It included efficient algorithms for computing indices and for identifying an optimal priority 
rule. 

Constraints that link the bandits (for the extension considered in ||6l) are dealt with in Sections 7-8 of 
the current paper. An optimal solution to the multi-armed bandit problem with W constraints is shown to be 
an initial randomization over VF + 1 priority rules, each of which is the optimal solution to an unconstrained 
bandit problem whose rewards are determined by a particular set of prices (multipliers) on the constraints. 
A column generation algorithm is described for computing such an optimal solution. In each stage, the 
coefficients of the column that enters the basis are found by the application of the policy evaluation procedure 
of Section 4. 

As concerns contributions to methodology, the analysis in the earlier sections of this paper rests on 
elementary row operations. Row operations are used in sections 3-4 to present the first efficient algorithm 
for computing the utility function gained when beginning at a given multi-state and using any given priority 
rule (solving the optimality equation is inefficient as the number of multi-states can be enormous). Row 
operations are also used in Section 5-6 to determine efficiently an optimal priority policy and to provide a 
proof for its optimality. The approach in the current paper builds on that of 161, but simplifies the theoretical 
development and the computation. In particular, the computation effort that the method requires matches the 



best existing bound for computing Gittins indices (obtained in |[T6l . see |[T3l p.43] (in fact, the same bound 
applies to the method developed in ll6l). 

2 The model 

Let K be the number of Markov chains (bandits), and let them be numbered 1 through K. Markov chain k 
has a finite set N^^ of states. No loss of generality occurs by assuming, as we do, that the states of distinct 
Markov chains are disjoint. Thus, each state j identifies the Markov chain h{j) of which it is a member, i.e., 
j G -^fe(j)- The set of all states of all bandits is given by A^ = A^i U • • • U Nk- 

If bandit k is played while its state is i, this bandit experiences transition to state j with probability 
p{i,j), and it experiences termination of play with probability p{i, 0) given by 

p(i,0) = l- J^p(i,j) ViGA^fc, VA;G{1,2,...,K}. 

If bandit k is played when its state is i and if transition is to occur to state j, payoff x(i, j) is earned at the 
start of the period; if termination is to occur instead, payoff x(i, 0) is earned at the start of the period. Each 
of these "payoffs" can be positive, negative or zero. 

Termination stops the play of all K bandits, not merely of the bandit that is being played. Termination 
is modeled as transition to state 0. No action is possible after transition to state 0. For this reason, state is 
excluded from Nj^ for each k, and hence from A^. 

Each multi-state s is a set that contains, for each k, exactly one state in N}^. When s is a multi-state, the 
symbol Sk denotes the state of bandit k that is included in s, and the symbol s\fc is defined by s\k = sVJSfc}. 
Thus, s\k contains all the states in s other than s^. Let S denote the set of all multi-states. Given any multi- 
state s, one of the bandits must be played. Hence, for this model, a stationary nonrandomized policy 6 is 
any map that for each multi-state s £ S picks a bandit 6{s) G {1,2,..., K}. Let A denote the set of all 
stationary nonrandomized policies. 

2.1 Utility 

The goal is to maximize expected utility. This will be accomplished with a linear utility function u{x) = x, 
with a risk-averse exponential utility function u{x) = —e~^^ where A is a positive constant that is known 
as the coefficient of risk aversion and with a risk-seeking exponential utility function u{x) = e^^ where A is 
a positive constant. 

All three cases are described and analyzed using the local utility function h{s, k, v) whose value equals 
the expectation of the total utility that is earned in the (artificially-truncated) one-transition model if multi- 
state s is observed now, if bandit k is selected now, and if utility v(t) is earned if transition occurs to 
multi-state t. 



2.2 Linear utility 

In the case of linear utility, the local utility function is 



h{s,k,v) =r{sk)+ J2 9(sfc,J>(s\feU{i}), (2.1) 

with data r{i) and q{i,j) that are specified by 

r{i}=p{i,0)x{i,0)+ Yl PihJMiJ) ViGiV, (2.2) 

qihJ)=PihJ), Vi,iGiVfe, yke{l,2,...,K}. (2.3) 

Interpret r(i) as the expectation of the reward that is earned immediately if bandit b{i) is played while its 
state is i, and interpret q{i,j) as the probability that bandit b{i) will experience transition to state j given 
that it is played while its state is i. As noted earlier, playing bandit b{i) while its state is i causes termination 
(rather than transition to some state j in A'fe(j)) with probability p{i, 0), which may be positive. The above 
model captures the classic discounted model, which has transition probability p{i,j) and discount factor c 
satisfying < c < 1, by replacing p{i,0) and p{i,j) in (I2.2I) - (I2.3I) by cp{i,0) and cp{i,j), respectively. 
Incorporating the discount factor into the transition rates yields a fundamental advantage - it facilitates an 
analysis that applies linear algebraic arguments instead of stopping times. 

2.3 Exponential utility 

With the risk-averse exponential utility function u{x) = —e~^^, one has u{x + y) = —e~^^^~^y' = 
e~^^u{y), and the local utility function is given by ( 12.11) with data r{i) and q{i,j) that are specified, for 
each i ^ N and j G Nf,(^j_-j , by 

r (i) = -p{i, 0) e-^^'(*'°) and q{i,j) = p{i,j) e'^^^^ j) . (2.4) 



With the risk-seeking exponential utility function u{x) = e^^, the local utility function is given by ( 12.11 ) 
with data 

r{i) = p{i, 0) e^^(^'°) and q{i, j) = p{i, j) e^^(*'^) . (2.5) 

With all three utility functions, r{i) is called a reward, and q{i, j) is called a transition rate. In the linear- 
utility model, q{i, j) is a probability. In the risk-averse exponential-utility model, q{i, j) is the product of a 
probability and a disutility. 

Bandit k has an \Nk\ x \Nk\ matrix q'^ whose ij*^ entry equals q{i,j) for each ordered pair {i,j) of 
states in N^. In each case, the entries in q'' are nonnegative. In the linear-utility case, q'' is substochastic, 
which is to say that its entries are nonnegative and the entries in each row sum to 1 or less. In the risk-averse 
exponential case, each state i has reward r{i) < 0. In the risk-seeking exponential case, each state i has 
reward r{i) > 0. 



2.4 A hypothesis 

A square matrix Q is transient if and only if each entry in the matrix Q* approaches as t — )• oo. A 
hypothesis that is shared by all three utility functions is presented below as: 

Hypothesis C. Expressions (I2.61 l and at least one of (12.71 ). (12.81 ) and (|2.9l l are satisfied. 

q ' is nonnegative and transient for /c = 1, 2, . . . , K . (2.6) 

q is substochastic for /c = 1, 2, . . . , K . (2.7) 

r{i) <0 for i = 1, 2, . . . , |iV| . (2.8) 

r{i) >0 for i = l,2,...,|iV|. (2.9) 



In the case in which (12.61 ) and (12.71 ) hold is dubbed Hypothesis RN (short for risk neutral). This case 
includes the classic discounted model, which has transition probability p{i,j), discount factor c that satisfies 
< c < 1 and q{i,j) = cp{i,j), so that each row of (g^)* sums to c, which guarantees that q^ is transient. 
Hypothesis RN also encompasses linear-utility models in which ratios akin to (11.11) would have O's in their 
denominator. Hypothesis RN is relaxed in Section 10. 

The case in which (12.61 ) and ( 12.81 ) hold is dubbed Hypothesis RA (short for risk-averse). In this case, the 
assumption that q^ is transient excludes a bandit whose repeated play would earn expected utility of — oo. 
Hypothesis RA is also relaxed in Section 10. 

The case in which (12.61 ) and (12.91 ) hold is dubbed Hypothesis RS (short for risk-seeking). In it, the 
assumption that q'^ is transient rules out bandits whose repeated play would earn expected utility of +cx3. 

Hypothesis C supports nearly all of the results in this paper. An exception occurs in Sections 7-8, where 
Hypothesis RN (and only it) is shown to accommodate constraints that link the bandits. 

2.5 Transient matrices 

A central role is played by matrices that are nonnegative and transient. Relevant information about these 
matrices is contained in Proposition 12. 1 [ below. It employs this nomenclature; vectors x and y that have the 
same number of entries satisfy x ^ y if and only if Xj > yj for each j. 

Proposition 2.1. Let Q be a nonnegative n x n matrix. The following are equivalent: 

(a) The matrix Q is transient. 

(b) The matrix (/ — Q) is invertible, and {I — Q)~^ = I + Q + Q^ + ■ ■ ■ . 

(c) There exists ann x n vector / ^ such that the equation {I — Q)x = f has a solution x ^ 0. 

(d) There exists ann x 1 vector y ^ such that y ^ Qy. 

Proof. Demonstration that (a) =^ (6) => (c) => (d) ^ (a) is routine and is omitted. ■ 

Parts of the analysis that follows could be simplified in the linear-utility case because a substochastic 
matrix q'' is transient if and only if termination occurs with positive probability after at most | A^^ | transitions. 



2.6 Inheritance 

Hypothesis C is a property of the individual bandits. Its impHcations for the multi-armed bandit are investi- 
gated next. Let us recall that S denotes the set of all multi-states of the multi-armed bandit. Each stationary 
nonrandomized policy vr has an |5| x \S\ transition rate matrix Q'^ that is given, for each pair s and t of 
multi-states, by 

I otherwise. 

Each stationary nonrandomized policy vr also has an IS*! x 1 reward vector K^ that is defined for each state 
s in S by 

^"(5) = r(s,(s)) • (2.11) 

Proposition 2.2. . Consider any stationary nonrandomized policy vr. Condition l \2.61) guarantees that Q^ is 
nonnegative and transient. 

Proof, (adapted from /[6|/j. By hypothesis, each bandit k has a transition matrix g*^ that is nonnegative and 
transient. That Q'^ is nonnegative is immediate from (12.101 ). Part (d) of Proposition 12.11 guarantees that each 
bandit k has a column vector x^ ^ such that x^' » q^x^. Denote as y the \S\ x 1 vector whose entry 
Us for multi-state s is given by ys = xl^x^^ . . . xf^. It is clear that y ^ 0. Consider any multi-state s; set 
k = 7r(s) and set i = Sfc. The nonzero entries in the s**^ row of Q"" correspond to the nonzero entries in the 
i*^ row of q^, and the inequality x^ :$> q^x'' guarantees ys > [Q'^y]s- This holds for each multi-state s, so 
part (d) of Proposition 12. 1 1 guarantees that Q'^ is transient. ■ 



That (12.61 ) is inherited by the multi-armed bandit is the gist of Proposition 12.21 That (I2.7I )- (I2.9I ) are 
inherited is evident from (12.101 ) and (12.1 It . Thus, the multi-armed bandit inherits the hypothesis that is 
satisfied by the individual bandits. 

2.7 A sequential decision process 

A well-developed theory of sequential decision processes (c.f., Denardo El or Veinott 11211 ) can be applied 
to the model whose local utility function is given by ( 12.11 ) with transition rates that satisfy (12.61 ). Proposition 
12.21 shows that each stationary nonrandomized policy vr has a transition rate matrix Q'^ that is nonnegative 
and transient, so Part (b) of Proposition 12. 1 1 shows that (/ — Q'") is invertible. With the \S\ x 1 vector V^ 
defined by 

y^ = (/ - Q'^)-!^'^ VvtgA. (2.12) 

Part (b) of Proposition 12. 1 1 also justifies the interpretation of the s^^ entry in V^ as the expected utility for 
starting in state s and using stationary nonrandomized policy vr until termination occurs. Premultiplying 
(12.121 ) by (/ — Q"^) produces the familiar policy evaluation equation, 

V^ = K^ + Q'^V . (2.13) 



With the I SI X 1 vector F defined by 

F(s) =max|y'^(s) :(5e AJ ^ s (^ S , (2.14) 

the number F{s) equals the largest expected utility obtainable from any stationary nonr^andomized policy, 
given starting state s. A policy vr is said to be optimal if V^ = F. The restriction to stationary nonrandom- 
ized policies is justified because Hypothesis C has been shown to suffice for such a policy to be optimal over 
the class of all history -remembering policies, see ||5] or ||2TI . Further, such a policy can be found by linear 
programming, by policy improvement, or by successive approximation. None of these methods is practical 
when the number \S\ of multi-states is large, however. 

3 Labeling and data revision 

Let us recall that each bandit k has a distinct set N^ of states, that N is the union of all states of all bandits, 
that is a special state that is not in N , and that termination is modeled by transition to state 0. A labeling 
L is the assignment to each j ^ N \J {0} of a label L{j) that is an integer between 1 and |A^| + 1, with 
L(0) = I A^l + 1 and with no two states having the same label. Thus, each labeling L assigns a distinct label 
to each state in A'^, and it assigns the highest label to state 0. 

A stationary nonrandomized policy vr for the multi-armed bandit is called a priority rule if it is deter- 
mined by a labeling L like so: 

7r(s) = argmin {L(sfc) : 1 < A; <i^} VsG5. (3.1) 

The priority rule vr in ( 13.11 ) is said to be keyed to the labeling L. Given any multi-state s, this priority rule 
plays the bandit k whose cuixent state s^ has the lowest label. 

3.1 Revised rewards and transition rates 

The notation is now simplified somewhat. For the remainder of this section, bandit k has n states (rather 
than \Nk\ states), and these states are numbered 1 through n. This bandit's transition rates form the n x n 
matrix q'', and its rewards fonii the ?i x 1 vector r^. 
Consider the state i in bandit k that has 

i = argmin{L(j) : j £ Nj,} . (3.2) 

Suppose a multi-state s is observed that includes state i S N^ and for which the priority rule vr has 7r(s) = k. 
The priority rule tt will continue to call for bandit k to be played until it experiences a transition to a state 
other than i. This motivates the replacement of each transition rate q{j,p) and each reward r{j) in bandit k 



by q{j,p) and r{j), where: 

q{i,p) = qii,p)/[l-q{i,i)] ifp/^, (3.3) 

q{j,p) = q{J,p) + q{J,i)q{hP) if j /i andp/i, (3.4) 

q{j, i) = for each j , (3.5) 

f{i) = r{i}/[l-qii,i)], (3.6) 

^(j) = r{j) +q{j,i)r{{) ifj/«- (3.7) 

The selection of i borrows from ||T9l . but that reference does not suggest any scheme to update the data as 
is done in (l331)-(ll7]). 

Repeated play replaces the bandit's transition rate matrix q^ by the matrix q'^ whose entries are given 
by (|3.3I )-( |33] ). and it replaces the bandit's reward vector r'' by the vector f^ whose entries are given by 
(I3.6I )-( IT7] ). The revised transition matrix and reward vector are for a model in which transitions to state i 
do not occur. It will soon be demonstrated that q^ and f'^ inherit the version of Hypothesis C that is satisfied 
by q'' and r^. 

3.2 Elementary row operations 

Equations ( 13.41 ). ( 13.51 ) and ( 13.71 ) describe a model in which the data of bandit k has been revised so that 
no transitions occur to the state i. This process can be iterated. The second execution of (13.21 ) occurs 
with state i removed from A^^^., and it selects the state i in Nk whose label is second lowest. And so forth. 
Algorithmically, the effect of repeated data revision is to begin with the n x (n + 1) matrix (tableau) 
[(/ — q'^),r''] and to use elementary row operations to alter the entries in this tableau Uke so: 

Triangularizer (for bandit k in accord with labeling L). 

1. Begin with the tableau [(/ - g^), r^]. Set M = Nk- While M is nonempty, do Steps 2 and 3. 

2. Find the state i ^ M whose label L{i) is smallest. Set a = 1/[1 — q^{i, i)]. 

(a) Replace row i of the tableau [(/ — q^),r^] by itself times the constant a. 

(b) For each state j ^ M \ {i}, replace row j of this tableau by itself plus the constant q{j, i) times 
(the updated) row i; this update equates qjt to for t = i and for each t in A^ \ M. 

3. Replace M by M \ {i}. 

The first execution of Step 2 replaces the tableau [(/ — q^), r^] by [(/ — q^),f^] where the entries in the 
n X 1 vector f^ and in the n x n matrix q^ are specified by ( |3.3I )- (I3.7| ) with i as the state whose label is 
lowest. The second execution of Step 2 replaces q'^ by the transition rate matrix q'^ for which transitions to 
state i are not observed and in which transitions to the state i whose label is second lowest are not obsei"ved, 
except for transition from i to i And so forth. 

Proposition 3.1. Suppose that the data for bandit k satisfy Hypothesis RN, RA, or RS. When the data 
for bandit k are triangulated in accord with a labeling L, each iteration of Step 2 produces a tableau 
[{I — q^), f^] that satisfies the same hypothesis. 



Proof. By hypothesis, q^ is nonnegative and transient. The initial execution of Step 2 of the Triangularizer 
replaces [(/ — g^), r^'] by [(/ — (f),f^]. It does so by multiplying row i by the positive number a and then 
replacing each row j other than i by itself plus the nonnegative multiple q(j, i) times the updated row (i). 
This guarantees q^ > 0. It further guarantees that f'^ < if r^' < and that f^ > if r^ > 0. In particular, 
( I2.8I ) and ( I2.9I ) are presei-ved. 

Since g^ is nonnegative and transient. Part (c) of Proposition 12.1 [ shows that there exists a vector / » 



such that the equation (/ — q* 



/ has a solution x S> 0. Let us apply the Triangularizer to the tableau 



[(/— g^'), /]. The initial execution of Step 2 replaces [{I—q''), f] by [(/— q^'), /]. Elementary row operations 



preserve the solutions to equation systems, so the strictly positive vector x satisfies (/ — q'^ 



f. Since 



q'' > 0, part (c) of Proposition 12. 1 [ also shows that q'^ is transient, hence that (12.61 ) is preserved. 

Finally, suppose that q^ satisfies (12.71 ). With e as the n x 1 vector of I's, note that (/ — q'')e = g with 
g > 0. As noted above, (/ — q'')e = g with g > 0, which shows that (12.71) is preserved. 

It has been demonstrated that Hypotheses RN, RA and RS are preserved by the first execution of Step 2 
of the Triangularizer. Iterating this argument completes the proof. ■ 

The computational effort for executing the Triangularizer is determined in the next result. 



Proposition 3.2. With n = \Ni.\, executing the Triangularizer on bandit k entails o 
arithmetic operations. 



n 



l|r)|2 -I- 2 |„| 



Proof. The computation of the [1 — q'' 



U,« 



]'s is Step 1 requires n subtractions. Next consider the execution 



of Step 2 when \M\ = m. As the entries in row i indexed by the columns of Nj. \ M are zero and are not 
changed in Substep 2(a) and as .Z'^kr \ = 1> Substep 2(a) requires m divisions (including the update or 
r^{i)). Also, Substep 2(b) requires (m — l)m additions and {m — l)m multiplications. Thus, the total 
number of arithmetic operation needed to execute both substeps is tti + 2ni{m — 1) = 2m? — m. As 






m 



Tl=i 2(™) + (T) = 2{-f) + r+1) = in3 + In^ + in and Z^.^, 



m 



^^, the total 



number of arithmetic operations needed to execute the triangularizer is n + 2[^n^ + ^n? + h 



n\ 



•n?+n 



|n3 - \v? + |n. 



3.3 Illustration 

The net effect of the Triangularizer is easiest to visualize when state 1 has the lowest label, state 2 has the 
next lowest label, and so forth. In this case, the Triangularizer transforms the tableau [(/ — q^),r^] into the 
n X (n + 1) tableau [(/ — (t),f^] whose entries have the format. 



1 


-9~(1,2) 


-gll,3) 





1 


-g(2,3) 








1 















-q{l,n) 
-q{2,n) 
-q{3,n) f(3) 



f(l) 
f(2) 



1 



r{n] 



(3.8) 



with I's on the principal diagonal and O's below that diagonal. With finalized data, each transition is to a 
state having a larger label, and termination is guaranteed to occur after n = \Nii\ transitions. 



With lineal- utility — but not with exponential utility — the finalized data have simple interpretations: 
Given that bandit b{i) is in state i, the number r{i) equals the expectation of the income that will be earned 
if bandit k is played until it experiences transition to a state whose label exceeds L{i), and q{i,j) is the 
probabiUty that this transition will occur to state j. 

3.4 Finalized data 

Here and henceforth, tildes are used to identify the rewards and transition rates with which the Triangularizer 
ends, as in f{i), q{i,j), f^, and f, and these data are said to he, finalized. The data for state i reach their 
finalized values when Step 2 is executed for state i. In other words, after Step 2 is executed for state i, no 
further changes occur in the i^^ row or column of the tableau [(/ — q^),r^]. 

Let TT be the priority rule that is keyed to the labeling L. Equations (12.101 ) and (12.11b specify the |5| x |5| 
matrix Q^ and the [Sj x 1 vector R^ in terms of the original data. Their analogs Q'" and K^ using finalized 
data are: 

I otherwise 

R^{s) = f(s,(,)). (3.10) 

It was demonstrated in Section 3 that Hypothesis C is inherited by the multi-armed bandit. Hence, with 
V'^{s) as the expected utihty for starting in state s and using priority rule vr, the vector V^ is the unique 
solution to V^ = K^ + Q'^V'" . Proposition 13.11 shows that the model with finalized data also inherits 
Hypothesis C, hence that its reward vector V"^ is the unique solution to the policy evaluation equation 

V^ = IT + Q'^V'^ . (3.11) 

That finalizing the data preserves expected utility is the gist of: 

Proposition 3.3. Suppose Hypothesis C is satisfied. Let tt be a priority rule that is keyed to a labeling L. 
Then V'^ = V" . 

Proof. A sequence of elementary row operations akin to those in the Triangularizer transforms the tableau 
[(/ — Q^), K^] into [(/ — Q'"), K^]. Elementaiy row operations preserve the set of solutions to an equation 
system. Hence, since V^ is the unique solution to (/ — Q^)V^ = K^ , it is the unique solution to (/ — 
Q-^)v-'' = BJ". m 

The Triangulaiizer first appeared in |0, with an elaborate analysis. An antecedent to it appeared in 
Kaspi and Mandelbaum fT4ll . and a contemporaneous account can be found in Sonin ifTSl . That elementary 
row operations simplify the analysis seems not to have been observed previously, however 

4 Policy evaluation 

Throughout this section, s is any given multistate, L is any given labeling and vr is the priority rule that is 
keyed to L. An algorithm that computes the expected utility V'^{s) will be presented. Proposition 13 .3 I shows 
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that V^ = V^ for which reason V^{s) can - and will be - computed using finalized data. With finalized 
data, for (any given) state i bandit b{i) is played at most once while in state i and the finalized return r{i) is 
earned if that event occurs. The expected utiUty V'^{s) is then a linear combination of the finaUzed rewards 
say 

V^s) = Y,4i)r{i); (4-1) 

i 

in the case of hnear utility, z{i) is the probability that bandit k is played when its state is i, with finalized 
data. 

A recursion will be used to compute the z(i)'s. Each step of this recursion updates entries in a set of 
vectors, on per bandit. For p = 1, . . . ,K, the vector yP has jA'^j entries, one per state, and is initialized by 

[0 if J G A^p \ {sp} 

Successively, for n = 1,2, ... , \N\, this procedure selects the state i having L{i) = n, sets k = b{i), 
updates y'^ by 

/(i) ^ [yHj)+yHmi,J)] iiJ^Nk\{i}, (4.3) 

/(i) ^ 0, (4.4) 

and makes no change in y^ for any p ^ k. Equation (14.31 ) augments the transition rate y'^(j) to state j by 
the transition rate y^{i)q{i,j) to state i and then directly to j. Equation (14.41) reflects the fact that no state i 
is revisited when finalized data are employed. 

The analysis of this procedure is eased by defining, forn = 1, 2, . . . , |A^|, 

Pn = {s e 5" : n > min{L(sfc) ■.l<k<K}. (4.5) 

Evidently, Pn contains those multi-states that include a state whose label is less than n. 



Proposition 4.1. Suppose Hypothesis C is satisfied. Interrupt the execution of d4.JD - d4^ just prior to 
the iteration in which it selects the state i having L{i) = n. At this moment, the quantity yP{j) equals the 
aggregate transition rate with finalized data of bandit pfrom state Sp to state j due to play at each multi-state 
in Pn. 

Proof. When n = 1, this result corresponds to the initial conditions. Suppose it holds for n > 1. Expres- 
sions (1431) and (|4!4b show that it holds for n + 1. ■ 



Proposition 14. 1 [ prepares for the analysis of the: 
Evaluator (for starting multi-state s, labeling L and priority rule vr that is keyed to L). 

1. For each bandit k, define y'' by (l42l ). Set 1/ = and n = 1. While n < \N\, do Steps 2 and 3. 
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2. Let i be the state whose label L{i) equals n, and set k = b{i). Replace V by 









(4.6) 



3. Execute (14.31 ) and then (14.41) for bandit k. Then replace n by n + 1. 

The next result shows that the Evaluator determines V'^{s). 
Proposition 4.2. Suppose Hypothesis C is satisfied. The Evaluator terminates with V = V'^{s) 



Proof. For i ^ N, setn = L{i) and k = b{i). The coefficient z{i) in (14.1b equals the aggregate transition 
rate from multi-state s to the set of multi-states s that have n = min{L(sp) : 1 <p < K}. From Proposition 



SU we obtain z{i) = y\i) Op^fe Y.,eN, V'iJ) 



, which completes the proof. 



The computational effort for executing the Evaluator is determined in the next result. 

Proposition 4.3. With n = Ylk=i \^k\, executing the Evaluator entails J2k=i fl-^fcP + 1^ 
operations (beyond the effort required to apply the Triangularizer on each bandit). 



5 arithmetic 



Proof. Augment the evaluator by keeping a record of w^ = J^jpn V^U) foJ" ^^ch p and of w 



CJ, E 



jeNp . 



^ieA^p 



y^j) 



The initial value of each of these expressions is 1. Keeping record of these ex- 
pression will facilitate the computation of the bracketed terms in (5.5) by a single division. 

Consider the implementation of Step 2 when i € Nj. is selected and m is the number of states in N^ 
whose label is lager than L{i). In this case the execution of (5.5) is Step 2 requires one addition, 2 multipli- 
cations and one division, totalling 4 arithmetic operations. Also, in step 3, (5.2) has to be implemented only 
to the m states in N^ whose label is higher than L{i), requiring m additions and m multiplications, totalling 
2m arithmetic operations. Next, X^.gjy V^iJ) has to be updated only for p = A; and this update requires 



requires the multiplication of the old value by 



m — 1 additions. Also, the update of YipJi Z^jeAf V^U) 

the ratio of the new and old values of J^j&N V^{j)^ requiring 2 arithmetic operations. The total number of 

arithmetic operation applied to execute steps 2 and 3 over all states i is then 



K \Nk\-l 

E E (3"^ + 5) 

k=l m=l 



K 

E 

k=l 



{3\Nk\ + W){\Nk\-l) 



K 



fc=i 



+ 2^ 



To our knowledge, the computation of V^ (s) for a particular priority policy vr and particular starting state 
s is new. With a different function (14.21 ). the Evaluator and its work bound apply to any initial distribution 
over the multi-states that is in product form (except that the initial values of the w^'s and w of the proof of 
Theorem 5.2 will require n-1 additional arithmetic operations). 
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5 Pairwise comparison and preference 

In this section, pairwise comparison will be used to identify a state that is "best" amongst a group of states, 
and the data for that state's bandit will be revised accordingly. The amplification a{i) of state i is now 
defined by 

a{i)= Y, q{i,j). (5.1) 

Under Hypothesis RN, each amplification is 1 or less. In the risk-averse and risk-seeking cases, some states 
can have amplifications that exceed 1, however. 

Playing chain b{i) when its state is i earns reward r(i) and multiplies future rewards by the factor a{i). 
State i is now said to be preferable to state j if 

r{i) + a{i)r{j) > r{j) + a{j)r{i) . (5.2) 

Suppose that state i is preferable to state j: if a multi-state s is observed that includes states i and j, playing 
bandit b{i) first and b{j) second is better than the other way ai^ound. The definition of preference is applied 
even when states i and j are in the same bandit, however. 

It will soon be seen that preference is not transitive, but that it can be refined in a way that is transitive. 
To this end, states will be grouped into "categories." The rule by which a category is assigned to each state 
varies with the hypothesis. 

5.1 Categories under Hypothesis RN 

Under Hypothesis RN, each state j has a{j) < 1, and each state is assigned a category by this rule: 

• Category 1 consists of each state j that has a{j) = 1 and r{j) > 0. 

• Category 2 consists of each state j that has a{j) < 1. 

• Category 3 consists of each state j that has a{j) = 1 and r{j) < 0. 

It is easy to see that each state j in category 1 that has r{j) > is preferable to every state in category 
2 and that each state in category 2 is preferable to every state in category 3. But no state in category 1 is 
preferable to any state in category 3. For this reason, preference is not transitive. State i is now said to be 
weakly preferable to state j if the inequality, 

r{i) + a{i)r{j) > r{j) + a{j)r{i) , (5.3) 

holds strictly or if this inequality holds as an equation and the category of i is at least as small as the category 
of j. Under Hypothesis RN, each state i is assigned a ratio p{i) by the following rule: 

+00 if state i is in category 1, 

p{i) = < r{i)/[l — a{i)] if state i is in category 2, (5.4) 

— oo if state i is in category 3. 

It is easy to check that state i is weakly preferable to state j if and only if p{i) > p{j). Evidently, 
weak preference is transitive. A state i that is weakly preferable to all others can be found with |A^| — 1 
comparisons. 
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5.2 Categories under Hypothesis RA 

Under Hypothesis RA, a state j can have a{j) > 1, but each state j has r{j) < 0, and the states group 
themselves into categories like so: 

• Category 1 consists of each state j that has r{j) = and a{j) < 1. 

• Category 2 consists of each state j that has r{j) < 0. 

• Category 3 consists of each state j that has r{j) = and a{i) > 1. 

As before, state i is said to be weakly preferable to state j if ( I5.3I ) holds strictly or if (15.31 ) holds as an 
equation and the category of state i is at least as small as the category of state j. Under Hypothesis RA, each 
state i is assigned a ratio p{i) by this rule: 

{+00 if state i is in category 1, 

[1 — a{i)\ /r{i) if state i is in category 2, (5.5) 

— oo if state i is in category 3. 

It is easy to check that i is weakly preferable to state j if and only if p{i) > p{j). 

5.3 Categories under Hypothesis RS 

In the risk-seeking case, each state j has r{j) > 0, and the states group themselves into categories by this 
rule: 

• Category 1 consists of each state j that has r{i) = and a{i) > 1. 

• Category 2 consists of each state j that has r{j) > 0. 

• Category 3 consists of each state j that has r{j) = and a{j) < 1. 
State i's ratio is now defined by: 



P{i) 



+00 if state i is in category 1, 

[a{i) — 1] /r(z) if state i is in category 2, (5.6) 

— oo if state i is in category 3. 



With this categorization, the definition of weak preference does not change. Again, state i is weakly pre- 
feixed to state j if and only if p{i) > p{j). 

5.4 Finding a weakly preferred state in a set 

The characterization of "weakly preferred" under RN, RA and RS by comparing p{-) shows that the relation 
is transitive. Further, if the r(i)'s and the [1 — a(z)]'s for each state i in a set U are available, then (15.41 ). 
( 15.51 ) or ( 15.61 ). respectively, facilitate the identification of a weakly preferred state in U by applying at most 
\U\ divisions and \U\ comparisons. 
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5.5 A key result 

Proposition 15.11 (below) would seem to have a simple proof, at least in the case of linear utility, but we are 
not awai^e of one. The interested reader is referred to the proof of Theorem 5.2 in |,6J, which employs a 
delicate interchange argument. 

Proposition 5.1. Suppose Hypothesis C is satisfied, and consider any state i that is weakly preferred to all 
others. It is optimal to play bandit h{i)for every multi-state s that includes state i. 

5.6 Nomenclature 

In the discussion to follow, the data for bandits 1 through K will be triangularized in parallel, rather than 
one after the other. At any stage in that computation: 

• r{j) and q{j, p) denote the current values of the data for state j, 

• f{j) and q{j,p) denote values of the data after they have been updated by the next execution of Step 
2 of the Triangularizer, 

• f{j) and q{j,p) denote the finalized values of the data. 

The ampUfication for state j is denoted a{j), a{j) and a{j) when it is given in terms of current, updated and 
finalized data, respectively. The same is true of the ratio, p(j). 

It is recalled the data for state i attain their finalized values when Step 2 is executed for state i. Proposi- 
tion [5]2] (below) indicates how each execution of Step 2 of the Triangularizer affects the ratios. 

Proposition 5.2. Suppose Hypothesis C is satisfied. With M as any nonempty subset of N^, suppose that 
state i in bandit k be weakly preferable to the other states M with current values of the data for bandit k. 
Executing Step 2 of the Triangularizer for state i has these effects: 

pii) = p{i), (5.7) 

p{i) > pU)>pU) ViGM\{i}. (5.8) 



Equation (15.7b states that finalizing the data for state i preserves its ratio. Expression (I5.81 l states that 
updating the ratio for a state j other than i can improve its ratio, but not above that for state i. These 
observations are insightful, but Proposition 15. 2| is not used in this paper, and its proof is omitted. Proposition 

facilitates the use of finalized data for each bandit, thereby enabling parallel computation. 



6 Optimization 

A labehng L is said to be optimal if the priority rule vr that is keyed to L has V^ = F, i.e., vr maximizes the 
expected utiUty that can be obtained from each starting multi-state. Proposition 15. 1 H ays the groundwork for 
a variety of algorithms that identify an optimal labeling. The Optimizer, which appears below, triangulaiizes 
the bandits contemporaneously, rather than one after the other. Its first execution of Step 3 identifies the state 



15 



i that is weakly preferable to all others with respect to the original data. Its first execution of Steps 3(a) and 
3(b) update the data for bandit b{i) in accord with repeated play while in state i and then remove state i. 
The Optimizer then repeats Step 3 with updated data. This recursion stops as soon as all states in one bandit 
have been removed 

Optimizer 

1. Begin C equal to the empty set. For each bandit k, insert in C a state i ^ N^ that is weakly preferable 
to every other state j G N^ with respect to the original data. Set n = 1. For each bandit k, set 

Mk = Nk. 

2. Do Step 3 while M^ is nonempty for each k. 

3. Find a state i ^ C that is weakly preferable to all other states in C with respect to current data. Set 

k = h{i) and set L{i) = n. Then replace n by n + 1. 

(a) Use Step 2 of the Triangularizer to finalize the data for state i and to update the data for each 
state j eMk\ {i}. 

(b) Remove state i from C. Remove state i from M^. If M^ is nonempty, insert in C a state j G M^ 
that is weakly preferable to all other states in M^ with respect to updated data. 

The Optimizer stops as soon as all of the states in any bandit have been labeled, with n — 1 as the highest 
of the labels. The unlabeled states can be assigned the labels n through | A^| in any way. It will not matter: 
no bandit whose state is labeled n or higher will ever be played because it cannot have the lowest label. 

The Optimizer applies the Triangularizer with respect to a labeling that is determined on line. At each 
stage, the state that gets the next label is selected so that it is weakly preferred to all states that have not yet 
been labeled, i.e., the states in U^^^Mfc. 

Proposition 6.1. Suppose Hypothesis C is satisfied. The Optimizer constructs a labeling L that is optimal. 

Proof. Let i be the state selected at the initial execution of Step 3. Weak preference is transitive, so 
Proposition 15. 1 1 shows that it is optimal to play bandit b{i) at each multi-state that includes state i. Setting 
L{i) = 1 is optimal. 

Step 3(a) equates to the transition probability q{j, i) for each state j in h{i), and, for each state j in 
bandit b{i), it updates the reward r{j), the transition probabilities q{j,p) for each p ^ i to account for 
repeated play while in state i. 

Step 3(b) removes state i from bandit b{i). What remains is a multi-armed bandit with one fewer state. 
Proposition 13.11 implies that the same version of Hypothesis C is satisfied by the bandit with one fewer state. 
Since weak preference is transitive, the state i that is selected at the second iteration of Step 3 is weakly 
preferable to all others in the model with revised data and one fewer state. Proposition 15. 1 l ean be applied a 
second time, and state i can be assigned the label L{j) = 2. Iterating this argument completes the proof. ■ 

The computational effort for executing the Optimizer is determined in the next result. 
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Proposition 6.2. With n = X]a:=i \^k\, the Optimizer can be executed with ^ X]fc=i l^fcP + § arithmetic 
operations and 2 X]fc=i l-^fcP + n{K — 2) "I" ( 2 ) comparisons plus the effort required to execute the Tri- 
nangularizer on each bandit). 

Proof. Augment the optimizer by recording a ranking of the elements of C in decreasing weakly prefer- 
able order and corresponding ratios of those states in C that are in category 2. The initial ranking can be 
accomplished with ( 2 ) comparisons whereas the ratios of the states in category 2 that enter C in Step 1 are 
computed when the those states are selected to enter C. 

When state i € Ni^ gets a label, Af^ changes and the Triangularizer updates the data of its states, 
including the r(j)'s and [1 — a(j)]'s. At each stage, finding a weakly prefen^ed state in Mk can be ac- 
complished with at most \Mk\ divisions (determining ratios for states in category 2) and at most \Mk\ — 1 
comparisons. Updating the ranked list C replaces the old state from b{i) by i, requires at most K — 1 
comparisons. So, the effort for executing the Optimizer, beyond the effort required to execute the Trinan- 
gularizer on each bandit, is bounded by ^^.^i ^rn=i "^ ~ I X]fc=i l-^^P + § arithmetic operations and 
Ek=i E^^iiK - 1 + m - 1) + (f ) = 1 Ef=i \Nk\' + n{K - 1) + (^) comparisons. 

Proposition 16. 1 [ [377] 14.31 and 16.21 show that an optimal priority rule and its expected utility F{s) for a 
particular starting state s can be computed with | J2k l-^fcP + 0{N'^) = 0{N^) arithmetic operations and 
0{N'^) comparisons. These last two bounds match the best existing bound for computing Gittins indices 
(obtained in ||T6l, see ESI p.43]). 

This section is closed with the mention of an alternative to the Optimizer. This alternative has two 
steps: First, optimize within each individual bandit. Second, use finaUzed data for each bandit and pair-wise 
comparison to rank the states 1 through |A^| by weak preference. Proposition 15.21 shows that the priority 
rule that is keyed to this ranking (labeling) is optimal. This procedure also requires work proportional to 

7 Optimization with Constraints 

For the case of a linear utility function that satisfies Hypothesis RN, the multi-armed bandit is now general- 
ized to include a finite number W of constraints, each on a particular type of reward. Including the objective, 
there are now VF + 1 types of rewai^d, which are numbered through W. The objective measures type-0 
reward and the w^^ constraint places a lower bound Cu, on the expected type-w reward. 

The initial multi-state s is given, and the object is to maximize the expectation of the type-0 reward 
subject to constraints that, for each w, keep the expectation of the type-tt; reward is at least as large as Cw 
The main thrust of this section is to use column generation to construct an optimal solution to the constrained 
problem that is an initial randomization over VF + 1 priority rules. At the end of the section, the approach 
taken here is compared with a more classic one. 

It is known (c.f., Feinberg and Rothblum 191) that an optimal policy can be found among the initial ran- 
domizations over stationary deterministic policies. This lets the multi-armed bandit problem with constraints 
be formulated as: 
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Program 1. Maximize ^^ a^VQ{s), subject to the constraints 

Es a^yiis) > C^ for w = l,2,...,W, 

a^ >0 for all 6, 

where it is understood that the sum is taken over all stationary deterministic policies 5 and where V^{s) 
denotes the expectation of the type-u; utility that is earned if one starts at multi-state s and uses policy 6. 
Program 1 has only VF + 1 constraints, but it can have a gigantic number of decision variables (one for each 
stationary deterministic policy 6 and one for each slack variable), and its data include the type-w rewai^d 
V^{s) for each w and each policy 5. 

Program 2, below, is in the same format as Program 1. Program 2 has one decision variable for each 
priority rule vr, rather than for each policy 5. 

Program 2. Maximize Yin o^'^^oi^)' subject to the constraints 

-Vw-. E.«"K[(s) >C^ for w = l,...,W, 

a'^ >0 for all vr. 

There are fewer priority rules than polices, but the number of priority rules can still be enormous. Mul- 
tipliers have been assigned to the constraints of Program 2. These multipliers will be used in column 
generation. 

7.1 Preview 

Although Program 2 has fewer columns than does Program 1 , computing the data it requires would still be 
onerous. Much of this computation can be avoided by coupling the simplex method with column generation. 
To indicate how, we suppose that a feasible basis for Program 2 has been found. This feasible basis consists 
of VF + 1 columns (the constraint matrix has full rank). It prescribes value of the basic variables and of the 
multipliers yo and —yi, . . . , —yw- These multipliers are used to define rewards in an unconstrained bandit 
problem whose optimal solution (found by the Optimizer) identifies a priority rule A whose corresponding 
column has reduced cost (marginal profit) c^ that is the largest. If c^ equals zero, the current basis is 
optimal. Alternatively, if c^ is positive, the Evaluator is used to compute the coefficients V^^, . . . , V^^. A 
simplex pivot is then executed, and the process is repeated. 

7.2 Feasibility 

Each column of Program 2 is a column of Program 1 . Thus, if Program 2 is feasible. Program 1 must also 
be feasible. The converse is established in: 

Proposition 7.1. Suppose Hypothesis RN is satisfied. If Program 1 is feasible, Program 2 is also feasible. 
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Proof. We will prove the contrapositive. Suppose that Program 2 is not feasible. An application of Farkas' 
lemma (equivalently of the duality theorem of linear programming) shows that there exist numbers yo and 
yi through y^ such that 

w 
yo-Y^ywV^is) > for all ^, (7.1) 

(7.2) 

yo-}_^ywCu, < 0. (7.3) 

UI=1 

The numbers yi through yw will be used as weights for the rewards ri{i) through r]v{i). Consider 
an unconstrained multi-anned bandit in which the reward R{i) for playing bandit b{i) while its state is i is 
given by R{i) = y\ri{i) + ■ ■ ■ + yw^wii)- Expression ( 17.11 ) states that with reward R{i) for each state i, 
no priority rule tt has aggregate reward that exceeds yo- Proposition |5?l] shows that a priority rule is optimal. 

Thus, 

w 

yo-Y.yn>V^{s)>0 for all (5. (7.4) 

tu=l 

where 6 ranges over all stationary deterministic policies. A solution exists to ( |7.2b - d7!4l ). so a second appli- 
cation of Farkas' lemma shows that no solution exists to the constraints of Program 1. ■ 



Thus, Program 1 is feasible if and only if Program 2 is feasible. Phase I of the simplex method will be 
soon used to determine whether Program 2 is feasible and, if so, to construct a feasible basis with which to 
initiate Phase II of the simplex method. For the moment, it is assumed that a feasible basis for Program 2 
has been found. 

7.3 Phase II 

The constraint matrix for Program 2 includes a column for each of the W slack variables. These columns 
are linearly independent of each other, and they are linearly independent of the other columns. Thus, the 
rank of its constraint matrix equals the number VF + 1 of its rows, and each basis for Program 2 consists of 
exactly VF + 1 columns. Let us consider an iteration of Phase II. At hand at the start of this iteration is a 
feasible basis, its basic solution and its multipliers. This information includes: 

• The data (column) for each of the VF + 1 basic variables. 

• The basic solution (the q'^'s) for this basis. 

• The multipliers yo and yi through yw for this basis. 

The multipliers yi through yw are nonnegative, and each priority mle A has reduced cost c^ that is given by 

w 

C^ = yO^(s) + ^y^y^^(s)-yo. 
•w=l 
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Computation of the reduced cost of each nonbasic priority rule A would be an onerous task, but it is not 
necessary. To determine whether or not the current basis is optimal and, if not, to find a priority rule has the 
largest (most positive) reduced cost, one can solve the unconstrained multi-armed bandit problem with the 
reward R{i) for playing bandit b{i) when its state is i given by 



w 
i?(i) = ro(«) + ^y^r^(i). (7.5) 

10 = 1 

With these rewards, the Optimizer in Section 7 computes a priority rule vr that is optimal. Also, the 
Evaluator in Section 5 computes the expected return F'^(s) for starting in multi-state s and using this priority 
rule. If V'^{s) < yo, no nonbasic variable has a positive reduced cost, so the current basis is optimal. 
If y^(s) > yo, the column for priority rule vr enters the basis. To compute V^{s) for each w, use the 
Triangularizer and Evaluator for priority rule tt. In this computation, the finalized rewards vary with w but 
the 2/^(i)'s do not. To complete an iteration of Phase II, execute a feasible pivot with a'^ as the entering 
variable. 

7.4 Phase II recap 

Each feasible basis and its basic solution prescribe an initial randomization (with weight a'" assigned to 
priority rule vr) over {W + 1 — p) priority rules, where p equals the number of slack variables that are basic. 

The multipliers for the current basis determine the data of an unconstrained bandit problem, and the 
procedure in prior sections computes its optimal priority rule vr and its expected return, V'^{s). If V'^{s) 
does not exceed yo, the current basis is optimal. If V'^{s) exceeds yo, the Evaluator is used to compute the 
coefficients Vq through V^^ of the entering variable. A simplex pivot is then executed. 

The pivot itself requires work proportional to (W + 1)^. Identifying the entering variable and its column 
of coefficients entails work proportional to (W^ + 2)[X]^ l-^fcP]- Only a few iterations may be needed to find 
a good basis, or an optimal basis, but that is not guaranteed. 

7.5 Phase I 

It remains to determine whether or not Program 2 is feasible and, if it is feasible, to construct a feasible 
basis with which to initiate Phase II. These tasks will be accomplished by "bringing in" the constraints of 
Program 2, one at a time. Starting with n = 1, the n^^ iteration of Phase I is initialized with a randomization 
over n — 1 priority rules that satisfy the first n — 1 constraints. The n^^ iteration maximizes type-n reward, 
using the Phase II column generation scheme described above. If the type-n income can be made as large 
as Cn, a basis has been found with which to initiate the n + 1*^' iteration. If not, no feasible solution exists 
to Program 2. 

7.6 The classic formulation 

An optimal policy for a discounted Markov decision problem with W constraints can be found among the 
stationary randomized policies (c.f., Altman ||T] page 102]). This can be accomplished by a linear program 
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whose constraint matrix has one column per state-action pair, one row per state, and one row per constraint. 
The multi-armed bandit has J = nfe=i l^fe I multi-states and K actions per multi-state. Its constraint matrix 
has W + J rows and K x J columns. The classic formulation has fewer columns than does Program 2, but 
it has many more rows. 

The classic formulation can also be attacked by column generation, but doing so would be unattractive 
because the fomiulation would have more columns and many more rows than the one we propose. 

7.7 A roadblock 

With a linear utility function, multiple types of reward can be handled by column generation and by the 
classic method. Both methods utilize the fact that each transition rate q{i,j) is independent of the reward 
type. 

Let us consider what occurs when multiple types of rewards are introduced in the model with exponential 
utility. Note from (12.41 ) and (I2.51 l that the payoff x{i,j) appears in the formula for the transition rate q{i,j). 
Having multiple types of income causes the transition rate q{i,j) to vary with the reward type. Consequently, 
our column generation method (and the classical one) can be applied only when the type-w; payoff Xw{i,j) 
is independent of w, for instance, this is the case when income is earned only at termination. 

8 Structural properties 

In the prior section, it was demonstrated that an optimal policy for a constrained bandit problem can be 
found among the initial randomization over VF + 1 priority rules. In the current section, the structure of this 
optimal policy is probed. 

A transient Markov decision problem (MDP) with W constraints has an optimal solution that is an initial 
randomization over VF + 1 deterministic policies 5^ through S^+^ each of which differs from the next at 
precisely one state of the MDP; see Feinberg and Rothblum Q. When this MDP is a multi-armed bandit, 
these deterministic policies need not be priority mles, however. 

Two labelings are now said to be adjacent if they are identical except that they exchange the states 
having labels k and A; + 1 for exactly one value of k. The aforementioned property raises the question: Does 
Program 2 have an optimal solution that is an initial randomization over priority rules that are keyed to a 
sequence oiW + 1 labelings with the property that each labeling is adjacent to the next? This question will 
be answered in the affirmative in the case of one constraint and in the negative in the case of more than one 
constraint. 

8.1 Adjacency with one constraint 

Let us consider a multi-armed bandit with one constraint. We have seen that an optimal basis for Program 2 
prescribes a randomization over at most two priority rules. If its basic solution for this basis sets a-' = 1 for 
any j, only priority rule is used, and adjacency is trivial. 

Let us denote as Vq and Vf as the type-0 and type-1 utility for column p. The case that requires analysis 
is that in which the optimal basis for Program 2 consists of columns j and k whose priority rules are keyed 
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to different labelings. For this to occur, tlie slack variable for the inequality constraint in Program 2 must 
not be basic, so this optimal basis assigns the columns j and k nonnegative values aj and Ok that satisfy 



a 



■jVi + akV^ = Ci and Uj + au = I . (8.1) 



The optimal basis for Program 2 assigns to its constraints values of the multipliers yo and — yi for which 
columns j and k have as their reduced costs. In other words, 

^ = V^+yiVi-yo and Q = V^+y^V^-yo. (8.2) 



If Vl = Vi, equation (18.11) guarantees that both columns have Ci as their type-1 utility, and equation (18.21 ) 
shows that both columns have the same type-0 utility, in which case it is optimal to play either column with 
probability 1, and a deterministic priority rule is optimal. 

It remains to analyze the case in which Ci lies strictly between ¥( and Vi- The labelings to which 
columns j and k are keyed need not be adjacent, but columns j and k can be used to construct an optimal 
basis with labelings that are adjacent. To indicate how, we turn to the example in Table 1. In this example, 
columns j and k assign identical labels to the states, except for the sets {6, 7, 8, 9} and {13, 14} of labels. 

Table 1. An optimal basis, 
label ... 6 7 8 9 ... 13 14 



column j ... a b c d ... f 
column k ... d c b a ... g 



Optimal solutions to the unconstrained multi-armed bandit having R{i) = ro(i) + yiri (i) for each state 
i are in product form. As a consequence, every column p whose labeling permutes the labels assigned to the 
sets {a, 6, c, d] and {/, g} of states has as its reduced cost in Program 2. A total of 7 = 1 + 3 + 2 + 1 
interchanges of states whose labels are adjacent converts the permutation for column k into the permutation 
for column j. One of these interchanges must move the type-1 reward from the side of Ci on which V^ lies 
to the side on which Vl lies, and that switch identifies a pair of adjacent labelings. This switch identifies 
a pair of priority rules that are keyed to adjacent labelings and whose columns fonii an optimal basis. The 
pattern exhibited by this example holds in general. The Triangularizer and Evaluator can be used to compute 
the reward vector for each labeling. 

8.2 Non-adjacency with two constraints 

For a multi-armed bandit problem with two constraints, an initial randomization over 3 priority rules has 
been shown to be optimal. Examples exist in which no optimal solution is an initial randomization over 
priority rules that are keyed to a sequence of three adjacent labelings. Such an example is now presented. 
This example has 3 chains (bandits), each of which consists of a single state. The three bandit's states are 
a, b and c, respectively. The multi-state (a, b, c) is observed initially. Playing any bandit causes immediate 
termination. Playing the bandit whose state is a earns the rewai^d vector (1,0, 0) whose entries are, respec- 
tively, the type-0, type-1 and type-2 reward. Similarly, playing the bandit whose state is b earns reward 
vector (0, 1,0), and playing the bandit whose state is c earns reward vector (0, 0, 1). The lower bounds on 
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expected type-1 and type-2 rewai^ds are Ci = 0.3 and C2 = 0.1. There are six labelings, which are Usted 
below. Labeling (v) has L{a) = 3, L{h) = 1 and L{c) = 2, for instance. 



state 



labehng (i) 


1 


2 


3 


labehng (ii) 


1 


3 


2 


labehng (iu) 


2 


1 


3 


labehng (iv) 


2 


3 


1 


labehng (v) 


3 


1 


2 


labehng (vi) 


3 


2 


1 



For this example, it is optimal to use labeling (i) or (ii) with probability of 0.6, to use labeling (hi) or 
(iv) with probability 0.3 and to use labeling (v) or (vi) with probability of 0.1. But no sequence of three 
labelings, one from each pair, is adjacent. For instance, labelings (i) and (hi) are adjacent to each other, but 
neither is adjacent to labeling (v) or (vi). 

9 Relaxing Hypothesis C 

The model with a risk-averse exponential utility function can be generalized by replacing Hypothesis RA 
with these conditions: 

• Each bandit k has a transition rate matrix q^ that is nonnegative. 

• At least one bandit k has a transition rate matrix q^ that is transient. 

• Every closed communicating class C of states in any bandit n has spectral radius of {q^)cc that 
exceeds 1. 

When Hypothesis RA is weakened in this way, the analysis becomes more intricate. One difficulty stems 
from the fact that if a policy vr has a transition rate matrix Q"^ that is not transient, its utility vector V^ cannot 
satisfy ( 12.131 ). The fact that the risk-averse exponential utility function has u(0) = — 1 and the weakened 
hypothesis can be used to work around this difficulty by ruling out any stationary policy that plays a bandit 
at each state in any closed communicating class. A second difficulty arises from the fact that the interchange 
argument in Proposition 15. 1 l ean no longer rest on the classic results in ||5l or 11211 . The interested reader is 
referred to the analysis in ||6] and to the characterization of optimal policies in Q. 

The linear-utility model can be generalized in a similar way. It suffices that each bandit has a matrix 
q^' that is substochastic, that at least one bandit k has a matrix g^ that is transient, and that every closed 
communicating class of states in any bandit has a gain rate that is negative. 

With each of these generalizations, only a minor change is required in the computation. The change is 
to avoid playing bandit b{j) if at some point in the computation it has transition rate q{j,j) that equals or 
exceeds 1. 



23 



10 Acknowledgements 

The authors are pleased to acknowledge that this paper has benefited immensely from the reactions of Dr. 
Pelin Cambolat to earlier drafts. The contribution of the second author has been supported in part by NSF 
grant CMMI-0928490. The contribution of the third author has been supported in part by ISF Israel Science 
Foundation) grant 901/10. 

References 

[1] Altman, E. 1999. Constrained Markov Decision Processes. Chapman & Hall/CRC, Boca Raton, USA. 

[2] Bergemann, D., J. Valimakim. 2008. Bandit problems. S. Durlauf, L. Blume, eds. Ttie New Palgrave 
Dictionary of Economics (2nd edition). 

[3] Beny, D. A., B. Friestedt. 1985. Bandit Problems. Chapman Hall. 

[4] Bertsimas, D., J. Niiio-Mora. 1993. Conservation laws, extended polymatroids and multi-armed bandit 
problems: a polyhedral approach to indexable systems. Mattiematics of Operations Researcti 21, 257- 
306. 

[5] Denardo, E.V. 1967. Contraction mappings in the theory underlying dynamic programming. SIAM 
Review 9, 165-177. 

[6] Denardo, E.V., H. Park, U.G. Rothblum. 2007. Risk-sensitive and risk-neutral multiarmed bandits. 
Mathematics of Operations Research 32, 374-394. 

[7] Denardo, E. V., U.G. Rothblum. 2006. A turnpike theorem for a risk-sensitive Markov decision problem 
with stopping. SIAM J. Control Optim. 45, 414-431. 

[8] El Karoui, N., I. Karatzas. 1994. Dynamic allocation indices in continuous time. Annals of Applied 
Probability 4, 255-286. 

[9] Feinberg, E.A., U.G. Rothblum. 201 1. Splitting randomized stationary policies in total-reward Markov 
decision processes. Mathematics of Operations Research, to appear. 

[10] Gittins, J.C. 1979. Bandit problems and dynamic allocation indices (with discussion). Journal of the 
Royal Statistical Society B. 41, 148-177. 

[11] Gittins, J.C. 1989. Multi-armed bandit allocation indices. John Wiley and Sons Inc. 

[12] Gittins, J.C, D.M. Jones. 1974. A dynamic allocation index for the sequential design experiments. 
J. Gani, K. Sarkadu, I. Vince, eds. Progress in Statistics, European Meeting of Statisticians I, North 
Holland, Amsterdam, 24 1-266. 

[13] Gittins, J.C, K. Glazebrook, R. Weber. 201 1. Multi-armed bandit allocation indices (2nd edition). John 
Wiley and Sons Inc. 

[14] Kaspi, H., A. Mandelbaum. 1998. Multi-armed bandits in discrete and continuous time. Annals of 
Applied Probability 8, 1270-1290. 



24 



[15] Katehakis, M., A.F. Veinott, Jr. 1987. The multiarmed bandit problem: Decomposition and computa- 
tion. Mathematics of Operations Research 22, 262-268. 

[16] Niiio-Mora, J. 2007. A {2/3)n^ fast pivoting algorithm for the Gittins index and optimal stopping of a 
Markov chain. INFORMS Journal on Computing 10, 596-606. 

[17] Schlag, K. 1998. Why imitate, and if so, how? A bounded rational approach to multi-armed bandits. 
Journal of Economic Theory 78, 130-156. 

[18] Sonin, I. 2008. A generalized Gittins index for Markov chains and its recursive calculation. Technical 
Report. Statistics and Probability Letters 78, 1526-1533. 

[19] TsitsikUs, J. 1994. A short proof of the Gittins index theorem. Annals of Applied Probability 4, 194- 
199. 

[20] Variaya, P., J. Walrand, C. Buyukkoc. 1985. Extensions of the multi-armed bandit problem: The dis- 
counted case. IEEE Trans. Automat. Control AC-30, 426-439. 

[21] Veinott, A.F., Jr. 1969. Discrete dynamic programming with sensitive discount optimality criteria. Ann. 
Math. Statist. 40, 1635-1660. 

[22] Weber, R. 1992. On the Gittins index for multiarmed bandits. Annals of Applied Probability 2, 1024- 
1033. 

[23] Weiss, G. 1988. Branching bandit processes. Probability in the Engineering and Informational Sciences 

2, 269-278. 

[24] Whittle, R 1980. Multi-armed bandits and the Gittins index. J Roy Statist. Soc. B 43, 143-149. 



25 



